This document is an introduction to information-geometric techniques for geospatial analysis. The core situation we consider is one in which our data consist of counts in two or more categories, distributed in space. A standard example is demographic census data, in which the categories might be racial groups, and the spatial units tracts. This document does not focus on mathematical details of the methods, which will be explained in a forthcoming paper. Rather, the focus is on the intuition behind various analytical methods, and how to carry them out in order to learn about spatial separation with freely-available software tools. The primary methods we use are implemented in the packages compx for the R programming language. Get compx by running:
# install.packages('devtools') if necessary
devtools::install_github('PhilChodrow/compx')
library(compx)To prepare data and analyze outputs of compx functions, we will also need the following packages:
library(tidyverse)
library(maptools)
library(igraph)
library(RColorBrewer) # for plotsThe functions provided by package compx assume that your data is expressed in two components. The first is a SpatialPolygonsDataFrame (spdf) containing geographic polygons (“tracts”). For illustrative purposes, compx comes bundled with an spdf with the Census tracts for Wayne County, Michigan, which includes the urban core of Detroit, an oft-analyzed city in quantitative segregation studies. The tracts were originally accessed via the R package tigris.
detroit_tracts %>% plot()The second data input must be a demographic table with class in c('tbl_df', 'tbl', 'data.frame'), which I will refer colloquially to as a “data frame.” compx includes an example in detroit_race, giving racial demographic counts for each tract for decennial Censuses 1970, 1980, 1990, 2000, 2010. Due to changing questions and collection methods, only 1990 data and later is comparable with 2010. The demographic data was assembled and normalized by the Project on Diversity and Disparities at Brown University.
detroit_race## # A tibble: 15,275 × 4
## t group n tract
## <int> <chr> <dbl> <chr>
## 1 1970 Asian 6 26163500100
## 2 1970 Black 0 26163500100
## 3 1970 Hispanic 0 26163500100
## 4 1970 Other 0 26163500100
## 5 1970 White 0 26163500100
## 6 1980 Asian 23 26163500100
## 7 1980 Black 2 26163500100
## 8 1980 Hispanic 32 26163500100
## 9 1980 Other 1 26163500100
## 10 1980 White 4175 26163500100
## # ... with 15,265 more rows
Note that detroit_race contains one row per combination of tract, time, and racial group. compx expects your demographic data to be in this tidy (or “long”) format. If it’s in a different shape, you may want to investigate the dplyr and tidyr packages to format it appropriately.
Three columns are required:
tract, the key relating the data the corresponding spdf. compx assumes that tract matches the GEOID column of spdf@data.group, the demographic category (such as racial group, in this case).n, the count of members of group in each tract.Additionally, you may include an optional column for time t. compx functions will automatically use this column of detected; if you don’t want to do temporal analysis, you should delete the \(t\) column if you have one. Any additional columns are ignored.
Before going any further, let’s do a bit of cleaning to ensure that we are only analyzing with tracts for which there is a reasonable amount of data available.
tracts_to_use <- detroit_race %>%
group_by(tract, t) %>%
summarise(n = sum(n)) %>%
filter(n > 100)
tracts <- detroit_tracts[detroit_tracts@data$GEOID %in% tracts_to_use$tract, ]
f_tracts <- tracts %>% # for plotting later
fortify(region = 'GEOID')Our analytical approach is motivated by information geometry. The core notion of information geometry is to view the data as lying on an information manifold, and then use some simple tools of differential geometry to study that manifold’s properties.
The fundamental manifold property is the metric tensor \(g\). The metric tensor contains complete information on how distances in information space are bent and curved, depending on where we are. It may therefore be used to compute distances. If \(p, q, r \in M\), then, provided \(p,q,\) and \(r\) are sufficiently close, the geodesic distance between \(q\) and \(r\) is approximately
\[ d(q, r) \approx \sqrt{g_p(q - r, q - r)}\;.\]
When \(g = I\), the identity matrix, this formula reduces to \(d(q,r) \approx \lVert q-r \rVert\), the Euclidean distance. When \(g\) differs substantially from \(I\), we get more interesting behavior.
We can view the metric tensor \(g\) as encoding the information we need to measure distances based on information, rather than on geography alone. A related and important idea of the metric tensor is that it encodes variation – places where the components of the metric tensor are large correspond to boundary areas between different demographic clusters.
Because the metric tensor is fundamental to our approach, compx includes a function that computes it directly from tracts and appropriately keyed data. If the keyed data contains no column labeled t, then compx assumes that all data are from the same time period. The simplest case looks like this:
race_2010 <- detroit_race %>%
filter(t == 2010) %>%
select(-t) # remove the t column if not using it
metric_df <- compute_metric(tracts, # use these tracts
race_2010, # and these data
km = T, # convert spatial units to km
sigma = 100, # for numerical derivatives
divergence = 'euc') # use the Euclidean distanceIn addition to the tracts and data parameters, compute_metric also allows users to specify whether length should be measured in km or degrees long/lat; the divergence function to use when comparing distributions, and sigma, a technical parameter relating to the computation of numerical derivatives required to compute the metric tensor.
The output of compute_metric is a data frame, keyed by geoid, which gives the total population of the tract as well as a list-column with the metric tensor in local coordinates. Since we didn’t give a time column, there are only two local coordinates x and y.
metric_df %>% head()## # A tibble: 6 × 3
## geoid total local_g
## <chr> <dbl> <list>
## 1 26163500100 4032 <dbl [2 × 2]>
## 2 26163500200 3201 <dbl [2 × 2]>
## 3 26163500300 3130 <dbl [2 × 2]>
## 4 26163500400 1659 <dbl [2 × 2]>
## 5 26163500500 2205 <dbl [2 × 2]>
## 6 26163500600 3703 <dbl [2 × 2]>
We can also take a quick peek at the metric tensor itself. For example, the tensors corresponding to the first two entries of metric_df are:
metric_df$local_g %>% head(2)## [[1]]
## [,1] [,2]
## [1,] 0.02862481 -0.01847180
## [2,] -0.01847180 0.01194856
##
## [[2]]
## [,1] [,2]
## [1,] 0.0002943192 -0.0000148503
## [2,] -0.0000148503 0.0015038466
The metric tensor is a symmetric matrix. Roughly, \(g_{ii}\) encodes the strength of the dependence on the frequency field on the \(i\)th coordinate. Thus, if we focus on
metric_df$local_g[[1]]## [,1] [,2]
## [1,] 0.02862481 -0.01847180
## [2,] -0.01847180 0.01194856
we can compare the entry of 0.0286248 in the x component to 0.0119486 for the y component to get a sense for the direction in which the demographic distribution is changing most rapidly. In this case, it’s the \(x\) direction.
To visualize the metric structure, it’s usually necessary to apply a scalar function to the metric tensor. Good ones are the trace and the square root of the determinant, both of which are usefully interpretable from the information-geometric point of view. Roughly, the trace corresponds to length distortions, and the determinant to volume distortions in information space.
trace_df <- metric_df %>%
mutate(trace = map_dbl(local_g, . %>% diag() %>% sum()),
det = map_dbl(local_g, . %>% det() %>% abs() %>% sqrt()))
trace_df %>% head()## # A tibble: 6 × 5
## geoid total local_g trace det
## <chr> <dbl> <list> <dbl> <dbl>
## 1 26163500100 4032 <dbl [2 × 2]> 0.0405733630 9.042476e-04
## 2 26163500200 3201 <dbl [2 × 2]> 0.0017981658 6.651243e-04
## 3 26163500300 3130 <dbl [2 × 2]> 0.0057983515 1.327444e-03
## 4 26163500400 1659 <dbl [2 × 2]> 0.0020904304 3.793111e-04
## 5 26163500500 2205 <dbl [2 × 2]> 0.0002271561 1.943405e-05
## 6 26163500600 3703 <dbl [2 × 2]> 0.0011665347 1.243781e-04
It’s useful to plot these scalars to see how they behave visually. Here’s the trace.
f_tracts %>%
left_join(trace_df, by = c('id' = 'geoid')) %>%
ggplot() +
aes(x = long, y = lat, fill = trace, group = group) +
geom_polygon() +
viridis::scale_fill_viridis(option = 'magma', trans = 'log10') +
ggthemes::theme_map() One way to think about the more intense areas with higher traces is that they are areas of transition. If you compare this visualization to, e.g. UVA’s Racial Dot Map you’ll find that the lighter areas correspond to the predominating white/black boundary, as well as smaller subdivisions.
compx tries to make it easy for you to incorporate time into your analysis as well. To do this, you just need to make sure that your data has a t column with appropriate integer or numeric values. Then, compute_metric will work just as before.
race_temporal <- detroit_race %>%
filter(t %in% c(1990, 2000, 2010)) # now we don't remove the t column
metric_df <- compute_metric(tracts,
race_temporal,
km = T,
sigma = 100,
divergence = 'euc')This version of metric_df is different in two respects:
metric_df %>% head()## # A tibble: 6 × 4
## geoid total local_g t
## <chr> <dbl> <list> <int>
## 1 26163500100 3878 <dbl [3 × 3]> 1990
## 2 26163500100 4304 <dbl [3 × 3]> 2000
## 3 26163500100 4032 <dbl [3 × 3]> 2010
## 4 26163500200 3057 <dbl [3 × 3]> 1990
## 5 26163500200 3341 <dbl [3 × 3]> 2000
## 6 26163500200 3201 <dbl [3 × 3]> 2010
First, there’s a t column for time. Second, the metric tensor is now 3 x 3, which reflects the fact that we’ve added a third coordinate (time).
metric_df$local_g[[1]]## [,1] [,2] [,3]
## [1,] 0.01286933 0.02589698 -0.010160315
## [2,] 0.02589698 0.05217391 -0.020459977
## [3,] -0.01016032 -0.02045998 0.008025002
In general, it’s not a good idea to directly compare temporal components of the metric tensor to spatial ones, since they have different units. In this case, the spatial components have units of kilometers, but the temporal ones have units of years. However, it’s not wrong to interpret this result as saying that, at this point in space and time, the demographic composition changes more when you move a km in space than when you move a year in time.
Just like before, we can apply some scalar functions to extract information about the metric tensor. The spatial trace is the sum of the first two diagonal entries, corresponding to the components of the metric tensor that encode spatial (not temporal) variability.
trace_df <- metric_df %>%
mutate(trace = map_dbl(local_g, ~ .[1:2, 1:2] %>% diag() %>% sum()))
f_tracts %>%
left_join(trace_df, by = c('id' = 'geoid')) %>%
ggplot() +
aes(x = long, y = lat, fill = trace, group = group) +
geom_polygon() +
viridis::scale_fill_viridis(option = 'magma', trans = 'log10') +
ggthemes::theme_map() +
facet_wrap(~t)The third component of the metric tensor quantifies the dependence on time. We can also extract that and visualize it:
t_df <- metric_df %>%
mutate(temporal = map_dbl(local_g, ~ .[3, 3]))
f_tracts %>%
left_join(t_df, by = c('id' = 'geoid')) %>%
ggplot() +
aes(x = long, y = lat, fill = temporal, group = group) +
geom_polygon() +
viridis::scale_fill_viridis(option = 'magma', trans = 'log10') +
ggthemes::theme_map() +
facet_wrap(~t)We can see that, the predominantly white suburbs toward the west have generally experienced greater demographic change over time than the prodominantly black urban core toward the northeast. We also observe that the temporal component of the metric tensor tends to correlate with the spatial component of the previosu visualization. So, spatial transition areas also tend to be temporal transition areas. This makes sense if you think that the dynamics of population composition are continuous in space, so that change happens at the boundaries.
t_df %>%
mutate(spatial = map_dbl(local_g, ~ .[1:2, 1:2] %>% diag() %>% sum())) %>%
ggplot() +
aes(x = spatial, y = temporal) +
geom_point() +
scale_y_continuous(trans = 'log') +
scale_x_continuous(trans = 'log') +
geom_smooth() +
facet_wrap(~t)The other main form of analysis enabled by compx is network-based analysis used to quantify and identify structure in demographic variation.
g <- construct_information_graph(tracts, race_2010, divergence = 'DKL')
g## IGRAPH DN-- 606 1516 --
## + attr: name (v/c), x (v/n), y (v/n), dist (e/n)
## + edges (vertex names):
## [1] 26163526400->26163526500 26163526200->26163526500
## [3] 26163526300->26163526500 26163534600->26163526500
## [5] 26163985000->26163526500 26163573600->26163526500
## [7] 26163576000->26163574202 26163574300->26163574202
## [9] 26163574100->26163574202 26163575400->26163574202
## [11] 26163546500->26163546600 26163546100->26163546600
## [13] 26163546700->26163546600 26163546300->26163546600
## [15] 26163540900->26163541500 26163541700->26163541500
## + ... omitted several edges
g is an igraph object with edge and node attributes. The dist edge attribute reflects how information-geometrically dissimilar are the connected nodes, where this dissimilarity is just the information distance between the nodes, calculated using the metric tensor. It’s useful to visualize this network. In the plot below, dissimilar tracts have thin edges between them, while very similar ones are joined by thick edges.
edges <- g %>% as_long_data_frame() %>% tbl_df()
nodes <- data_frame(x = V(g)$x, y = V(g)$y)
ggplot() +
geom_polygon(aes(x = long, y = lat, group = group), fill = 'firebrick', alpha = .6, data = f_tracts) +
geom_segment(aes(x = from_x,
y = from_y,
xend = to_x,
yend = to_y,
size = exp(-7 * dist^2)), color = 'black', data = edges) +
ggthemes::theme_map() +
scale_size_continuous(range = c(0,1)) +
guides(alpha = 'none', color = 'none', size = 'none')Comparing to the figures above, you might notice that areas with thin or invisible edges correspond to the same “border” areas that we saw highlighted by the spatial trace above.
One useful way to analyze the spatial structure of segregation is via the eigenspectrum of the Laplacian matrix corresponding to the graph, a method inspired by spectral clustering. In the plot below, large gaps in the spectrum correspond to strong “signals” in the spatial structure; intiutively, it “makes sense” to cut the structure into that many pieces.
A <- affinity_matrix(g, sigma = 1) # square exponential affinity with specified sigma
L <- generalized_laplacian_matrix(A) # L_{rw} in the tutorial cited above
evL <- eigen(L, symmetric = T) # compute the spectrum
data.frame(n = 1:30, ev = 1 - rev(evL$values)[1:30]) %>%
ggplot() +
aes(x = n, y = ev) +
geom_line() +
geom_point() +
scale_y_continuous(trans = 'log10') +
geom_vline(xintercept = 2.5, linetype = 'dashed') +
geom_vline(xintercept = 7.5, linetype = 'dashed') +
geom_vline(xintercept = 10.5, linetype = 'dashed') +
theme_bw()For example, we see that in Detroit with under the KL divergence, there’s a large gap after \(n = 2\) and then a smaller one at \(n = 7\). Of course, it’s important to realize that there is some judgment required in identifying these hard cutoffs – it’s not a single cutoff but the spectrum of \(A\) that most fully describes the community structure.
We can try to actually identify clusters based on the affinity matrix \(A\). There are lots of approaches for finding clusters in an affinity matrix, but a simple and workable one is hierarchical clustering.
# do the clustering
clust <- hclust(as.dist(-A), method = 'complete')
# assign node attributes to g giving the clusters
ncluster <- 7
g <- g %>% set_vertex_attr(name = 'cluster',
index = colnames(A),
value = clust %>% cutree(ncluster))
# extract as data frame. We already extracted the edges df above.
nodes <- data_frame(x = V(g)$x, y = V(g)$y, cluster = V(g)$cluster)
ggplot() +
geom_segment(aes(x = from_x,
y = from_y,
xend = to_x,
yend = to_y,
size = exp(-7 * dist^2)), color = 'black', data = edges) +
geom_point(aes(x = x, y = y, color = as.character(cluster)), size = 2, data = nodes) +
ggthemes::theme_map() +
scale_size_continuous(range = c(0,1)) +
guides(alpha = 'none', color = 'none', size = 'none') +
scale_color_brewer(palette = 'Set1')This is a reasonably strong result; the clustering has detected the predominating division between the white suburbs (green) and the predominantly black urban core (blue), as well as “Mexicantown” (red), the historically distinct community of Grosse Point (yellow), and Hamtramck (brown), an independent town with large European and Asian immigrant populations. Not all clusterings I’ve worked with have quite such intuitive behavior, but all demonstrate promising performance.
It’s also possible to use construct networks at multiple time-slices. Just like before, it’s necessary to use data that has a t column with appropriate values.
g <- construct_information_graph(tracts, race_temporal, divergence = 'DKL')When t column is provided, the names of g are concatenations of the GEOID of the tract and the corresponding value of \(t\). There’s an edge between each tract and its corresponding step forward or backward in time.
g## IGRAPH UN-- 1818 5760 --
## + attr: name (v/c), geoid (v/c), t (v/n), x (v/n), y (v/n), dist
## | (e/n)
## + edges (vertex names):
## [1] 26163526400_1990--26163526500_1990 26163526200_1990--26163526500_1990
## [3] 26163526300_1990--26163526500_1990 26163534600_1990--26163526500_1990
## [5] 26163985000_1990--26163526500_1990 26163573600_1990--26163526500_1990
## [7] 26163576000_1990--26163574202_1990 26163574300_1990--26163574202_1990
## [9] 26163574100_1990--26163574202_1990 26163575400_1990--26163574202_1990
## [11] 26163546500_1990--26163546600_1990 26163546100_1990--26163546600_1990
## [13] 26163546700_1990--26163546600_1990 26163546300_1990--26163546600_1990
## + ... omitted several edges
We can also cluster on these temporal networks, in just the same fashion as we did before.
# construct the affinity matrix and do clustering
A <- affinity_matrix(g, sigma = 1)
sigma <- 1
clust <- hclust(as.dist(-A), method = 'complete')
# assign clusters as a node attribute
ncluster <- 12
g <- g %>% set_vertex_attr(name = 'cluster',
index = colnames(A),
value = clust %>% cutree(ncluster))It’s more complex to visualize these networks in a workable way, but not hard to visualize the clusters over time:
# get the edges, only including ones that are "within" a time slice.
edges <- g %>%
as_long_data_frame() %>%
tbl_df() %>%
mutate(type = ifelse(from_t == to_t, 'spatial', 'temporal')) %>%
filter(type == 'spatial') %>%
mutate(t = as.integer(stringr::str_sub(from_t, -5)))
# get the nodes, including temporal and cluster info
nodes <- data_frame(x = V(g)$x,
y = V(g)$y,
t = V(g)$t,
cluster = V(g)$cluster)
# plot the clusters, faceting on time
pal <- brewer.pal(9, name = "Set1") %>%
colorRampPalette()
ggplot() +
geom_segment(aes(x = from_x,
y = from_y,
xend = to_x,
yend = to_y,
size = exp(-7 * dist^2)), data = edges, color = 'grey40') +
geom_point(aes(x = x, y = y, color = as.character(cluster)), size = 1, data = nodes) +
facet_wrap(~t) +
scale_size_continuous(range = c(0,1)) +
ggthemes::theme_map() +
guides(color = FALSE, size = FALSE) +
scale_color_manual(values = pal(ncluster)) Currently, this temporal clustering method has some limitations – tracts don’t change demographics very quickly over the time period we are considering, and so clusters tend to be quite persistent. This in itself isn’t a bad thing per se, but it’s also the case that we generate some slightly counterintuitive clusters this way. Getting useful clusterings is a major area of focus ahead.
However, there’s also a substantial amount of signal in the clusters. For example, the clustering detects that the historically white community of Grosse Point (the far eastern tip) has somewhat receded from 1990 to 2010, with its western-most areas being “absorbed” into the predominantly black cluster. This makes sense when we visualize how relative percentages of White and Black residents have changed over time:
# construct a data frame of percentages.
percent_df <- race_temporal %>%
filter(t %in% c(1990, 2010)) %>%
group_by(tract, t) %>%
mutate(percent = n /sum(n)) %>%
rename(race = group) %>%
filter(race %in% c('Black', 'White'))
# plot the df
pal <- brewer.pal(9, 'Blues') %>%
colorRampPalette()
f_tracts %>%
left_join(percent_df, by = c('id' = 'tract')) %>%
tbl_df() %>%
ggplot() +
aes(x = long, y = lat, group = group, fill = percent) +
geom_polygon() +
facet_grid(t ~ race) +
ggthemes::theme_map() +
scale_fill_distiller(palette = 'BuPu', direction = 1) +
theme(panel.background = element_rect(fill = 'grey80'))Priorities for future work on compx include refining clustering methods, performance enhancements, and improvements TBD based on suggestions from those in the planning and sociological communities.